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Abstract 

We address the issue of knots selection for Gaussian predictive process methodology. 
Predictive process approximation provides an effective solution to the cubic order com- 
putational complexity of Gaussian process models. This approximation crucially depends 
on a set of points, called knots, at which the original process is retained, while the rest is 
approximated via a deterministic extrapolation. Knots should be few in number to keep 
the computational complexity low, but provide a good coverage of the process domain 
to limit approximation error. We present theoretical calculations to show that coverage 
must be judged by the canonical metric of the Gaussian process. This necessitates hav- 
ing in place a knots selection algorithm that automatically adapts to the changes in the 
canonical metric affected by changes in the parameter values controlling the Gaussian 
process covariance function. We present an algorithm toward this by employing an in- 
complete Cholesky factorization with pivoting and dynamic stopping. Although these 
concepts already exist in the literature, our contribution lies in unifying them into a fast 
algorithm and in using computable error bounds to finesse implementation of the predic- 
tive process approximation. The resulting adaptive predictive process offers a substantial 
automatization of Guassian process model fitting, especially for Bayesian applications 
where thousands of values of the covariance parameters are to be explored. 

Keywords: Gaussian predictive process. Knots selection, Cholesky factorization. Pivoting, 
Bayesian model fitting, Markov chain sampling. 



1 Introduction 



Bayesian nonparametric methodology is driven by construction of prior distributions on func- 
tion spaces. Toward this, Gaussian process distributions have proved extremely useful due to 
their mathematical and computational tractability and ability to incorporate a wide range of 
smoothness assumptions. Gaussian process models have been widely used in spatio-temporal 



modeling (Handcock and Stein , 1993 



lation (Kennedy and O'Hagan 



and classification (Neal 1998 Csato et al. 



2001 



Kim et al. 



2005 



2007), density and quantile regression (Tokdar et al. 



2000 



Banerjee et al. 2008), computer emu- 



Oakley and OHagan 2002), non-parametric regression 



Rasmussen and Williams , 2006 



2010 



tional data analysis (Shi and Wang, 2008; Petrone et al. , 2009), image analysis (Sudderth and 



Jordan, 2009), etc. Rasmussen and Williams (2006) give a thorough overview of likelihood 
based exploration of Gaussian process models, including Bayesian treatments. For theoreti- 



Tokdar and Kadane 



Short et al. 



2011), func- 



cal details on common Bayesian models based on Gaussian processes, see Tokdar and Ghosh 



(2007), Choi and Schervish (2007), Ghosal and Roy (2006), van der Vaart and van Zanten 



2008 



2009 ) and the references therein. 
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For our purpose, a Gaussian process can be viewed as a random, real valued function 
uj = {uj{t),t G T) on a compact Euclidean domain T, such that for any finitely many points 
ti, - ■ ■ ,tk G T the random vector {u}{ti), ■ ■ ■ ,u}{tk)) is a /c-dimensional Gaussian vector. A 
Gaussian process uj is completely characterized by its mean and covariance functions ^{t) = 
E[uj{t)] and tp{s, t) = Cov[uj{s),io{t)]. For a Gaussian process model, where a function valued 
parameter w is assigned a Gaussian process prior distributions, the data likelihood typically 
involves uj through a vector W = (a;(si),--- ,u){sj\f)) of u values at a finite set of points 
si, • • • , Sat G T. These points could possibly depend on other model parameters. The fact 
that is a Gaussian vector makes computation conceptually straightforward. 

However, a well known bottleneck in implementing Gaussian process models is the 0{N^) 
complexity of inverting or factorizing the NxN covariance matrix of W. Various reduced rank 



approximations to covariance matrices have been proposed to overcome this problem (Smola 



and Bartlettl 12001; Se eger et al.[ | 2003{|Schwaighofer and Tresp]|2003HQuihonero-Candela and 



Rasmussen 



2005g |Snelson and Ghahramani , 2006 ) , mostly reported in the machine learning 



literature. Among these, a special method of approximation, known as predictive process 
approximation (Tokdar, 2007, Banerjee et al. , 2008), has been independently discovered and 
successfully used in the Bayesian literature. The appeal of this method lies in a stochastic 
process representation of the approximation that obtains from tracking cj at a small number 
of points, called knots, and extrapolating the rest by using properties of Gaussian process 
laws (Section 2.1 ). 

For predictive process approximations, choosing the number and locations of the knots 
remains a difficult problem. This problem is only going to escalate as more complex Gaussian 
process models are used in hierarchical Bayesian modeling, with rich parametric and non- 
parametric formulations of the Gaussian process covariance function becoming commonplace. 
To understand this difficulty, we first lay out (Section 2.2 ) the basic probability theory behind 
the approximation accuracy of the predictive process and demonstrate how the choice of 
knots determines an accuracy bound. The key concept here is that the knots must provide 
a good coverage of the domain T of the Gaussian process. While this is intuitive, what 
needs emphasis is that the geometry of T is to be viewed through the topology induced by 
the Gaussian process canonical metric p{s,t) = [E{a;(s) — a;(t)}^]^/^, which could be quite 
different from the Euclidean geometry of T. 

This theory helps understand (Section 2.3) why existing approaches of choosing knots. 



based on space filling design concepts ( |Zhu and Stein , 2005 Zimmerman , 2006 Finley et al. 



2009 ) or model extensions where knots are learned from data as additional model parameters 



(Snelson and Ghahramani, 2006: Tokdar, 2007 Guhaniyogi et al. , 2010), are likely to offer 



poor approximation and face severe computational difficulties. A fundamental weakness of 
these approaches is their inability to automatically adapt to the changes in the geometry of 
T caused by changes in the values of the covariance parameters. 

In Section [3] we present a simple extension of the predictive process approximation that 
enables it to automatically adapt to the geometry of the Gaussian process covariance function. 
This extension, called adaptive predictive process approximation, works with an equivalent 
representation of the predictive process through reduced rank Cholesky factorization (Section 



3.1 ) and adds to it two adaptive features, pivoting and dynamic stopping. Pivoting determines 



the order in which knots are selected from an initial set of candidate points while dynamic 
stopping determines how many knots to select. The resulting approximation meets a pre- 
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specified accuracy bound (Section 3.2). 

The connection between predictive process approximation and reduced rank Cholesky 
factorization is well known ( Quinonero-Candela and Rasmussen, 2005) and pivoting has been 



recently investigated in this context from the point of view of stable computation (Foster 



et al. , 2009). The novelty of our work lies in unifying these ideas to define an adaptive 



version of the predictive process and in proposing accuracy bounds as the driving force 
in finessing the implementation of such approximation techniques. The end product is a 
substantial automatization of fitting Gaussian process models that can broaden up the scope 
of such models without the additional burden of having to model or learn the knots. This is 
illustrated with two examples in Section |4j 



2 Predictive process approximation 
2.1 Definition 

Fix a set of points, referred to as knots hereafter, {ti,t2,--- ,im} C T and write uj{t) = 
^{t) + ^(t), where, 

u{t) = E{co{t) I uo{h),--- ,uj{tm)} 

and ^{t) = uj{t) — E{uj{t)\u){ti), - • ■ ,u)(tm)}- By the properties of Gaussian process laws, 
1/ and ^ are independent Gaussian processes. The process i^, called a Gaussian predictive 
process, has rank m, because it can be written as i^(t) = YlT^i ^iV'(^j5^) with the coefficient 
vector A = (Ai, • • • , Am) being a Gaussian vector. By replacing uj with u in the statistical 
model, one now deals with the covariance matrix V = {1/(31), ■ ■ ■ ,t^isN)), which, due to 
the rank-m property of u, can be factorized in 0{Nm?) time. 



2.2 Accuracy bound 

Replacing oj with u can be justified as follows. Let 6 — sup^g-p mini<j<j7i p{t,ti) denote 
the mesh size of the knots, where p{t,s) = [E{uj{t) — u}{s)}'^]^^'^ is the canonical metric on 



T induced by u! (Adler, 1990, page 2). For a smooth 'ilj{t,s), S can be made arbitrarily 
small by packing T with sufficiently many, well placed knots. But, as 5 tends to 0, so does 
At^ = supigy Var{^(t)}. This is because for any t G T, and any i E {!,••• ,1^}, by the 
independence of 1/ and ^, 

Yai{C{t)} = Var{w(t)} - Var{z^(t)} 

= E[Var{a;(t)|L^(ti),--- ,t^(t„)}] 

= E[Var{w(t) - io{ti)\co{ti), ■ ■ ■ ,co{tm)}] 

<Yar{uj{t) - uj{ti)} = p'^{t,ti), 

and hence k < 5. That k can be made arbitrarily small is good news, because it plays a key 
role in providing probabilistic bounds on the residual process ^. 

Theorem 2.1. Let uj be a zero mean Gaussian process on a compact subset T C BP . Let v 
he a finite rank predictive process approximation of u with residual process ^ = uj — u. 
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(i) IfT C [a^Vf and there is a finite constant c > such thatYar{uj{s)—uj(t)} < c^||s— 
s,t € T then 



e2 



P(sup|e(t)| >ej <3exp(^--^ ) , Ve>0 (1) 



with B = 27^2pc(6 — a) and k? = sup^gj^ Var{^(t)}. 
(a) For any finite subset S C T 

p(snpm\>e) <3e^p\- f |, Ve > (2) 

KtGS J [ 94(2 + log|5|) J 

where \S\ denotes the size of S and k| = sup^g^ Var{,^(t)}. 

A proof is given in Appendix [Aj Note that the constant B does not depend on the 
number or locations of the knots, it only depends on the dimensionality and size of T as well 
as smoothness properties of the covariance function oj. It is possible to replace k in ^ with 
^2{i-r;) £qj, g^j^y aj-iji^j-^j-y jy £ ^0, 1), but with a different constant B. While provides an 
accuracy bound over the entire domain T, the bound in ^ over a finite subset maybe of 
more practical value. 

For Gaussian process regression models with additive Gaussian noise, a common modifica- 



tion ( Finley et al. , 2009 ) of predictive process approximation is to replace lo with the process 



9 = 1' + ^* where ^* is a zero mean Gaussian process, independent of u and ^, satisfying, 
rnv/;^*m A*r.U - / Cov{e(t),e(s)} = VarC(t) if t = s 

The addition of ^* gives D the same pointwise mean and variance as those of uj, without 
adding to the computational cost. The residual process is now given hy £^=uj — D = S^ — ^* 
whose variance equals 2Var{^(t)} because of independence between ^ and Because ^* is 
almost surely discontinuous, the bound in |l| does not apply to ^. But ^ continues to hold 
with Kg replaced by = 2k|. 



2.3 Need for adaptation 

For predictive process approximations, choosing the number and the locations of the knots 
remains a difficult problem. Ideally this choice should adapt to the canonical metric p, so 
that a small 6 obtains with as few knots as possible. However, it's not a single p that we 
need to adapt to. In modern Gaussian process applications, the covariance function and 
consequently the canonical metric depend on additional model parameters. A typical example 
is UJ of the form 

Uj{t) = UJo{t) + Xl{t)uJi{t) + + Xp{t)u}p{t) 

where Xj{tys are known, fixed functions and Wj's are independent mean zero Gaussian pro- 
cesses with covariances i/jjit, s) = tJ exp(— — with 6 = (tq, ri, • • • , r^, /3o, Pir • • , l^p) 
serving as a model parameter. Because a likelihood based model fitting will loop through 
hundreds or even thousands of values of 6, it is important to have a low-cost algorithm to 
choose the knots that automatically adapts to the geometry of any arbitrary canonical metric. 
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Such an adaptive feature is lacking from existing knots selection approaches which primar- 
ily treat knots as additional model parameters. The knots are then learned along with other 
model parameters, either via optimization (Snelson and Ghahramani 2006) or by Markov 



chain sampling (Tokdar, 2007; Guhaniyogi et al. , 2010). Another popular approach is to 



work with a fixed set of knots based on space-filling design concepts ( Zhu and Stein , 2005 

Among these, the proposal in Finley et al. (2009 



Zimmerman 


2006 


Finley et al. 


2009 



overlaps with our proposal. But while we pursue a low-cost adaptation at every value of 9 
at which likelihood evaluation is needed, Finley et al. (2009) consider one fixed set of knots 
adapted to a representative value of 6. Their knot selection algorithm has 0{N'^m) comput- 
ing time, which makes it infeasible to run within an optimization or a Markov chain sampler 
loop. 



3 Adaptative predictive process 

3.1 Predictive process approximation via Cholesky factorization 

Let Lo = {uj{t),t G T) be a zero-mean Gaussian process with covariance ip{s,t). Suppose the 
finite set S = {si, S2, • • • , SAf} C T contains all points in T where uj needs to be evaluated 
for the purpose of model fitting. Let 9 = ((V'ij)) denote the N x N covariance matrix of 
the Gaussian vector W = ,uj{siy))- A Cholesky factor A of with A being a 

N X N upper triangular matrix with non-negative diagonal elements and satisfying ^ = A'A, 
obtains from the following recursive calculations 

An = JV'n-E^'.' X,, = '^'^~^^<'^'''^'\ t = l,.-- ,N,j = i + l,.-- ,N (3) 



with Xij, j < i set to zero. This gives a row-by-row construction of A and requires Cm^^"^ 
computation time for constructing the first i rows for some machine dependent constant Cm- 
For an m G {1, • • • , A^}, an approximation A = {{Xij)) to A obtains in 0{Nm?) time by 
an incomplete application of the above recursion. The first m rows of A are constructed in 
CmNvo^ time through (p|: 



\^= U^^-Y.{^UY. X,^ = 'h^^fd^^ i = 1, • • • , m, J = Z + 1, • • • , A^. (4) 

For the remaining rows, only the diagonal elements are computed in Cm{N — m) time as in 
the first part of ([s]), with the off diagonals set to zero: 



-JZ^^^^i^^, Aij=0, i = m + l,--- ,N,j = i + l,--- ,N. (5) 

e<m 

The lower triangular elements Xij, j < i, are all set to 0. The resulting A is upper triangular 
with non- negative diagonals and equals the Cholesky factor of the covariance matrix ^' of ^ = 
(z^(si), • • • , u{s]\f)) where u = {u{t),t G T) is the Gaussian predictive process approximation of 
UJ based on knots si, • • • , Sm- The resulting residual process ^ = uj — i' satisfies Var{^(si)} = 0, 
1 < i < m and Var{^(si)} = {Xa}'^, i = m + 1, ■ ■ ■ , N, and hence k| := max^es Var{^(s)} = 
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maxj>m A?j. From controls error bounds P{maXs£S — i'{s)\ > e) over the set of 

interest S. 

Therefore, the above incomplete Cholesky factorization produces a Gaussian predictive 
process approximation, with readily available error bounds, provided we are happy to choose 
the knots from the set 5. The restriction to S appears reasonable for most applications with 
the additional burden on the modeler to identify S carefully. For example, in a Gaussian 
process regression model with additive noise, it is sufficient to take S to be the training set 
of covariate values, if only posterior predictive mean and variances are needed at test cases. 
But if posterior predictive covariance between two test cases, or a test and a training case is 
required, then S should include these test cases as well. 



3.2 Pivoting and dynamic stopping 

Our quest of an adaptive version of stays within this restriction, but employs a dynamic 
choice of the stopping time m and the order in which the elements of S are processed. To 
decide whether the current stopping time is acceptable, we check the current ns against a 
given tolerance Kto|. If ks exceeds the tolerance, we increment m to m + 1, and repeat Q 
only for i equal to the new value of m, followed by ([s]), producing an update of ks- The top 
row elements i < m need no changes. 

The increment of m and the subsequent alterations to A clearly reduce ks as the tailing 
Ajj's in ^ are reduced. This reduction can be expected to improve if after incrementing m 
and before proceeding with the new calculations, one swaps the current m-th and fc-th rows 
of ^ where i = k gives the maximum of the tailing Xa values, i = m, ■ ■ ■ , N . A sequence 
of such swaps, from start to the terminating m, gives a greedy, dynamic approximation to 
finding the optimal ordering of the elements of S that gives a Kg < Ktoi with a minimum 
stopping time m. 

The dynamic swapping is a common feature, known as pivoting, of all leading software 
packages for Cholesky factorization. If run until m = N, pivoting produces a permutation 
TT = (vTi, • • • , ttn) of (1, • • • , A^) and an upper triangular matrix A with non-negative diag- 
onals such that Pj^^P^ = A'A where Pjr is the N x N permutation matrix associated with 
TT. Our proposal above simply adds to this pivoted Cholesky factorization a dynamic, tol- 
erance based stopping. The resulting A gives the Cholesky factor of the covariance matrix 
of F = (z^(si),-- - ,z^(siv)) where ly is the Gaussian predictive process associated with the 
knots , • ■ ■ 7 ^TTm • The Gaussian predictive process u comes with the error bound ^ with 
Ktoi replacing k. The additional computing time needed for pivoting is only 0{Nm), a small 
fraction of the computing time 0{Nw?') needed to get the elements A if tt was precomputed. 

Algorithm[T]gives a pseudo code for performing the incomplete Cholesky factorization with 
an additional improvisation. The user specified tolerance Ktoi is taken to be a relative tolerance 
level instead of an absolute one. The absolute tolerance is fixed as Ktoi times the maximum 
of \ai{uj{si)}'^^'^ , i = 1, - ■ ■ ,N. This makes sense for Gaussian process approximation as the 
maximum variance of the process can be viewed as a scaling parameter. 



3.3 Geometric Illustration 



We illustrate the adaptive choice of knots on the Bartlett experimental forest dataset (Finley 



et al. 2009 ). This dataset contain n = 437 well identified forest locations si, • • • ,Sn, measured 
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Algorithm 1 Pivoted, incomplete Cholesky factorization with dynamic stopping. 
Require: A covariance function •), positive integers N and mmax and tolerance Ktoi > 0. 

R OmmaxXmmax 

TTi ^ 1, 7r2 2, • • • , TTAT A/" 
k ^ 1 

^ argmaxi<Kiv ip{sni,Sni) 

max'^tol 

while dmax > K^^i do 
swap TTfc and 7r(/max) 

,1/2 

fkk ^ Umax 

for _7 = A; + 1 to m do 

rkj ^ {i^{s 
end for 
A; A; + 1 

dmax ^ maXk<l<N V'(s7rp SttJ - Y.l<k ^fk 
^max ^ argmaXjt<KJvV'(s7rpS7r;) - J^Kk^lk 

end while 

m A; 

for A; = m + 1 to do 

rkk ^ {Hs^k^s^k) - J2i<mrLV^^ 
end for 

return Factor matrix R, pivot tt and rank m 
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as eastings (on the horizontal axis) and northings (vertical axis) from a reference point (Figure 
[T]). For illustration purposes, we consider only a hypothetical Gaussian process model where 
a surface uj{s) over the forest area is modeled as a zero-mean Gaussian process with a square- 
exponential covariance function 

^{s,t) = x{s)x{t)eM-m{s - t)f), (6) 

for some constant /3 > 0, an orthogonal projection matrix Q and some fixed function x{t) 
that relates to the slope of the forest landscape at location t. The variance of w(t) is x(t)^ 
and the correlation between u;(s) and uj{t) depends on the Euclidean distance between the 
Q- projections of these two location vectors. The parameter /3 > encodes the spatial range 
of the covariance, i.e., it controls how rapidly ip{s,t) decays with the distance between s and 
t. We demonstrate how the choice of knots according to Algorithm [T] adapts to variations in 
each of /3, x{t) and Q. 

Figure 1(a) shows nine choices of ([g]) that vary in Q and /3, while x{t) and Ktoi are held 



fixed at x{t) = 1 and k^^i = 10~^. The top row has /3 = 10~^, the middle row has /3 = 5 x 10" 
and the bottom row has /3 = 10~^. The left column has Q equal to the identity matrix, the 
middle column has Q that projects along the horizontal axis and the right column has Q that 
projects along the vertical axis (indicated by arrows). It is clear that the algorithm picks 
fewer knots as the spatial range decreases. A smaller spatial range gives a flatter topology 
in the canonical metric, consequently fewer points are required to capture the variation in 
the random surface cj(t). It is also clear that the algorithm adapts to the directional element 
of this topology. It lines up the knots along the horizontal axis when Q is the horizontal 
projection and lines up the knots along the vertical axis when Q projects along that axis. 
Figure 1(b) shows nine other choices of ^ that vary in Q and x{t), while /3 and k^oi 



are held fixed at /3 = 10~^ and k,^^^ = 10~^. Variation in Q is as in Figure 1(a) We take 
x{t) = 1 — F{c - slope(t)|a, b), where F{x\a, b) denotes the gamma distribution function with 
shape parameter a and rate parameter b, and slope(t) is the slope of the landscape at point 
t. We fix a and b so that the mean and variance of the gamma distribution match the mean 



and variance of the recorded slope values at the 437 locations. The top row of Figure |l(b) 



has c = 0, the middle row has c = 1 and the bottom row has c = 6. Larger values of c 
make x{t) closer to zero at regions with a high slope while x{t) always equals 1 at regions 
that are flat (shown in lighter color, mostly along a narrow valley running from south-east 
to north-west). With a larger c, most of the variation in uj{t) is conflned to locations t with 
a flat slope. It is clear that our algorithm adapts to this feature by picking knots from such 
areas. 

Figure 1(c) shows (pi) with /3 = 10~^, x{t) = 1, Q = the identity matrix, but with four 



choices of K^j = 10~^, 10~^, 10~^ and 10~^ (clockwise from top left). For smaller tolerance 
levels, more knots are picked to meet a tighter accuracy condition. A good coverage of the 
entire region is maintained throughout, but additional knots are chosen to give a denser 
representation. Note the higher concentration of knots at the boundary than the interior. 
This is a consequence of the greedy nature of the algorithm as it tries to pick the next knot as 
the point that is least correlated with the ones already selected. Although a better algorithm 
could correct for such a boundary bias, the linear additional computing cost of the greedy 



search offers a highly attractive trade-off against a few extra knots. Figure 1(d) indicate that 



the number of knots m required to meet the tolerance criterion grows at a logarithmic rate 
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Figure 1: Geometry of knot selection illustrated on Bartlett experimental forest data, (a) 
Adaptation of knots to changes in Q (columns) and (3 (rows), (b) The same for changes in 
Q (columns) and x{s) (rows), (c) The same for changes in Ktoi. (d) Number of knots needed 
to meet specified accuracy bounds for a particular choice of the covariance. 
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with I/kjoi- However, theoretical results are not yet available on the relationship between m 
and Ktoi- 

4 Application to Bayesian computation 

4.1 Sparse nonparametric regression 

A low cost, adaptive choice of knots is extremely beneficial in Bayesian Markov chain Monte 



Carlo (Robert and Casella, 2004 Gilks et al. , 1995 ) computations, where likelihood evaluation 



is needed at thousands of different values of the covariance parameters. We illustrate this 
with a sparse non-parametric regression model, where a good exploration of the space of 
covariance parameters is critical to model fitting. We show that to efficiently explore the 
covariance parameter space, it is important to adapt to the changes in the canonical metric 
caused by the changes in these parameter values. In this regard, the adaptive predictive 
process approximation proposed here has a clear advantage over the existing approaches of 
handling knots. 

We consider a toy data set consisting of {xi,yi), i = I,-- - ,n = 10,000, where Xj = 
(xji, • • • , Xip)' are drawn independently from the uniform distribution over [0, 1]^, with p = 10, 
and Ui are generated independently as yi ~ A^(2 sin{27rxji}, 0.1^). We consider a regression 
model 

Vi = ^1 + Tuj{xi) + Tei, ei ™ iV(0,fT^), (7) 

with u; modeled as a zero-mean Gaussian process over T = [0, 1]^. The covariance function 
of (jo is taken to be 

p 

ijis, t) = m = exp{- ^ f3]isj - tjf}, t,s£T. 

For simplicity of exposition, we set /j. and at the mean and variance of the observed yi 
values, and focus on learning the parameters /3i,-- - ,/3p and cr^. The 13 j parameters are 
assigned standard normal prior distributions, folded onto the positive real line and logo"^ 
is assigned a standard normal prior distribution. Prior independence across parameters is 
assumed. 

A fixed set of knots over T is clearly infeasible for this application due to the dimension- 
ality of T. Placing only 3 knots along each axis takes the total count to 3^° = 59049. While 
the alternative approach of placing an auxiliary model on the knots and learning them jointly 
with the covariance parameters via Markov chain sampling can drastically reduce the total 
number of knots, it is likely to lead to a poor exploration of the covariance parameters for 
the following reason. 

Consider a Gibbs update for /3i given the remaining parameters and the knots. Suppose 
the current parameter values are /3i = /32 = 0.1, Ps = ■ ■ ■ = f3p = 0.004 and = exp(— 3.5). 
For these parameter values, a "well learned" choice of the knots is found by applying Al- 
gorithm [l] to tl^{s,t\P) with f3 set at the vector of current values (we use k^^, = 10~^). 
The corresponding posterior conditional density p{(3i\f32, • • • , f3p, cr^, knots, data) is shown by 
the solid curve in Figure [2] If, instead, the current values had been /3i = 1, /32 = 0.1, 

= ■ ■ ■ = f3p = 0.004, fj^ = exp(— 3.5) and the knots were chosen by applying Al- 
gorithm [l] to the corresponding Tp{s,t\P), then the resulting posterior conditional density 
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Figure 2: Conditional posterior density of /3i, in the non-parametric regression model, given 
other parameters and the knots, for two different choices of the knots. 



p{Pi\h: ■ ■ ■ 1 <7^, knots, data) would look like the dashed curve in Figure[2| The two curves 
are very different even though the values of the conditioning parameters /32 , • • • , /3p and 
are the same. This difference shows that the conditional distribution of the covariance pa- 
rameters strongly depends on the current set of knots, which, if well learned, should depend 
on the current values of the covariance parameters. Consequently, there will be additional 
stickiness in the chain of sampled values of the covariance parameters. 

The proposed adaptive predictive process gets rid of this additional stickiness by auto- 
matically adapting the knots to the covariance parameter values. As discussed before, for 
this application it is reasonable to restrict the search of knots to the set of observed covariate 
vectors S = {xi, • • • ,Xn}- Then, one only needs to specify a tolerance level Ktoi to obtain 
an approximation p{y\j3,a'^) of the marginal likelihood p{y\l3,a'^) = J p{y\uj,a'^)p{duj\l3) of 
{I3,a'^). The approximation obtains by replacing uj in the integral with its predictive process 
approximation adapted to the corresponding ip{s,t\f3). The approximate marginal likeli- 
hood can be computed in 0{nm?) time, where m is the number of knots needed for il){s, t\j3) 
with the given tolerance level; detailed formulas are available in |Snelson and Ghahramaiu] 
(2006). The approximate posterior density a'^\data) oc p(?/|/3, a'^)p{(3, cr^) can be explored 
with common Metropolis-Hastings samplers. Figure [3] reports summaries from a random walk 
Metropolis sampler exploration with a tolerance level k^^i = 10~^. 

4.2 Varying coefficient regression for spatial data 



Pace and Barry (1997) use county level data from 1980 United States presidential election to 



relate voter turnout to education, income and homeownership standards. They use a spatial 
autoregressive model to allow this relation vary geographically. We pursue an alternative 
formulation with a Gaussian process spatial regression model. 
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Figure 3: A random walk Metropolis sampler exploration of the posterior for the non- 
parametric regression problem and associated posterior summaries, (a) Trace plots of the 
sampled values of /3i, • • • , (3iq, a. (b) Histograms of the same, (c) Posterior median (solid line) 
and 95% credible intervals (dashed lines) for /(x) at x = (z/100, 0, ••• ,0), i = 0, ••• ,100, 
overlaid on the data scatter along xi and y. (d) Values of m along the run of the sampler, 
shown at every 100th sweep. 
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For county i = I,-- - ,n, let Vi, Pi, Ei, li and Hi denote, respectively, voter count, 
population size of age 18 years or more, population size with at least high school education, 
aggregate income and number of owner occupied housing units. We take log-percentage 
voter count m = log{Vi/Pi) as the response variable. Three predictor variables are defined as 
xii = log{Ei/Pi), X2i = \og{Ii/Pi) and xu = \og{Hi/Pi). 

We relate the response to the predictors through a spatially varying regression model 

yi = ujQi + xiiUii + X2iUJ2i + x^iiuj^i + ei, ™ iV(0, a^). (8) 

To ensure that geographically proximate counties have similar coefficients, the vector of 
coefficients from all counties (wji,--- ,a;j„), for each j = 0,1,2,3, is taken to be a zero 
mean multivariate normal with Co\{ujji,ujjk) = exp{— — where ti is the spatial 

location vector of county i, given by the latitude- longitude pair of the county centroid. These 
four vectors are taken to be mutually independent. 
The above formulation is equivalent to 

= ujiU) + ei, ei"r5N{0,a^) (9) 

where uj{t) is a zero-mean Gaussian process on T = {ti,--- with covariance function 
'tlj{s,t\9) = Yl%(}Xji't)xj{s)exp{-l3'j\\s - t\\'^},s,t £ T, with xo{ti) = 1, Xj{ti) = xji, etc, 
that depends upon the parameter vector 9 = (tq, n, r2, T3, /3o, /3i, /32, /Ss). These covariance 
parameters are each assigned a standard normal prior distribution, folded onto the positive 
half of the real line, independently of each other. We also assign logcr^ a standard normal 
prior distribution, and log is taken to be a priori independent of 9. 

We fit this model using 1040 randomly chosen counties, roughly one third of the total 
count. Figure |4] shows a random walk Metropolis sampler exploration of the approximate 
posterior density p{0,cr'^\y) oc p{y\9 , a'^)p{9)p{a'^) with k^^, = 10~^ and S taken to be the set 



of locations for the counties included in model fitting. Trace plots in Figure 4(a) indicate good 
convergence of the sampler. The marginal posterior distributions of the model parameters 
appear unimodal (Figure [4(b) [ ). Figure 4(c) shows knots locations from two different sweeps 



of the sampler. It is interesting that the knots are not uniformly distributed over the whole 
country. The sampled values of m, shown in Figure [4(d)| indicate that unlike the regression 
example of the previous section a substantial fraction (~ 15%-35%) of points from S are 
needed to meet the accuracy bound. 

Figure [5] shows summaries of our model fit. Figure 5(a) shows the percentage voter turnout 



(bottom) and the predicted percentage turnout (top) for all 3106 counties. The predicted 
value for county i is calculated as the Monte Carlo approximation to exp[E{u;(tj) | data}]. Fig- 
ure [5(b)] combines the values from Figure 5(a) into a scatter plot, split by counties included in 



model fitting (green dots) and the remaining ones (red dots). Figure 5(c) shows the spatially 



varyin g regr ession coefficients E{ujji\data} on predictors Xji, j = 1,2,3, for all counties i. 



Figure 5(d) shows the effect size of these coefficients, found by E{ujji\data} / y^Varj j | data} . 

From Figure [5] we see that predictor x^, which relates to home ownership, has a strong, 
spatially varying infiuence on voter turnout. This predictor has a positive coefficient for 
all counties, with larger values and effect sizes for counties in the southeast. Predictor xi, 
which relates to education, has a moderate, spatially varying infiuence, with about 1/3 of 
the counties having an effect size larger than 2. It is interesting that coefficients of xi and 
X3 appear to be inversely related to each other. Predictor X2, which relates to income, has 
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Figure 4: A random walk Metropolis exploration of the posterior for the election turnout 
analysis, (a) Trace plots of model parameters, (b) Histograms of the same, (c) Knot 
locations at two distant sweeps of the sampler, (d) Values of m along the sampler. 
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Figure 5: Posterior summary for the election turnout analysis, (a) Observed (bottom) and 
predicted (top) percentage turnout shown by counties, (b) Scatter plot of the same, green 
dots in the background mark training data while the red dots in the foreground are for held 
out data, (c) Estimated coefficients for the three predictors shown by counties, (d) Effect 
sizes of the three predictors shown by counties. 
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a weak influence, with effect size less than 2 in absolute value for more than 95% of the 
counties. However, the spatial variation of the coefficient of X2 is more intricate and wavy 
than that of the other two predictors. Although we do not try to interpret these variations, 
we note that spatial regression model indeed provides a better fit than an ordinary linear 
regression that relates y to xi, X2 and X3. The root mean square prediction error from the 
ordinary linear regression, calculated over the counties not included in model fitting, equals 
0.14. The same statistic for the spatial regression model is 0.11. 



5 Discussion 



We have addressed the question of knot selection within the predictive process methodology 
and have offered proposals that can substantially automatize the implementation of Gaus- 
sian process models in Bayesian analysis. A key conceptual contribution of our work is the 
emphasis on error bounds to derive a finite rank approximation of an infinite dimensional 
Gaussian process. It must be noted that the accuracy bounds we provide are all a priori. 
That is, we can not provide an accuracy bound on how well the posterior distribution un- 
der the predictive process model approximates the posterior distribution under the original 



Gaussian process model. However, Tokdar (2007) provides some useful theoretical calculation 
toward this. 

We note that approaches that use an auxiliary model on knots are fundamentally different 
from our deterministic choice of knots driven by accuracy bounds. Our approach clearly stays 
within the limits of an approximating method. The smaller the specified tolerance, the closer 
we are to the original Gaussian process. Approaches with a model on the knots essentially 
define a different stochastic process. The new stochastic process could indeed provide a better 



model for the given task, as discussed in Snelson and Ghahramani (2006). However, it would 



be erroneous to assume that the theoretical properties of a Gaussian process model would 
also apply to this new stochastic process model. 

It is important to realize that at the crux of a predictive process approximation is the rank 
deficiency of the covariance matrix of the Gaussian vector W = (a;(si), • • • ,uj{sn))-, which is 
a manifestation of the underlying smoothness of the process u. For Gaussian process models 
that employ a relatively un-smooth uj, such as spatial autoregressive models for lattice data 



(Besag 1974 Rue et al. , 2009), predictive process may not be the ideal approximating tool. 



Indeed, in such cases, the covariance matrix of W need not be rank deficient, but can have 
special banded structures that allow for other approximation techniques to yield 0{Nm?) 
computing. 

For a smooth w, the rank deficiency of the covariance matrix of W poses an additional 
problem. The covariance matrix, irrespective of its size, can be ill conditioned, making numer- 
ical computations unstable. The dynamic, tolerance based stopping of the adaptive predictive 
process can solve this problem to a large extent, because the knot finding algorithm is likely 



to terminate before the covariance matrix of (a;(s7ri), 



, W St, 



J) becomes ill conditioned. 



A Technical details 



Proof of Theorem 2.1. Let ^(x) = ie^' , x > 0. '^{x) is increasing and convex with \E'(0) G 



(0,1). For any random variable Z its ^'-Orlicz norm ( [Pollard 1990 page 3) is defined as 
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:= inf{C > : E^'dZl/C) < 1}. Such a norm provides bounds on tail probabilities 
as follows: P(|Zj > x) < l/^'(x/||Z||vi,) = 3exp(-xV||Z|||) for all x > 0. This immedi- 
ately leads to and Q once we show || sup^gj- < Bk}/"^ and || sup^g^ < 
3K5Vlog(2 + log|5|). 



(i) Lemma 3.4 of Pollard ( 1990 ) states that for any to £ T, 



A A 

sup m\h < mto)h + E ^ V 2 + ^osDi-,T, d) 

< U{to)U + 9 / Vlog D{e,T,d)de 
Jo 



(10) 
(11) 



where D{e, T, d) is the e-packing number of T under a (pseudo) metric d{s, t) with 
A = supgfd{s,t), provided ||^(s) — C(*)ll* < d{s,t) for all s,t G T. It is easy to 
calculate that \\Z\\^ = 1.5 if Z ~ iV(0, 1) and that ||Z||vi> = 1.5a if Z ~ iV(0,of). 
Therefore, for any s,t e T, \\^{s) - ^{t)\\^ = 1.5[Var{C(s) - ^(t)}]^/^ Therefore (|lO) 
holds if we take d{s,t) = 1.5[Var{C(s) - C{t)}V^^- 

To calculate the right hand side of ( (Tol ), fix to to be any of the knots used in defining u. 
Then ■^(to) = and consequently the first term ||C(io)||<i' = 0. To calculate the integral 
in the second term, furst note that 

d{s,tf = 2.25Var{^(s) - ^(t)} < 2.25Var{a;(s) - uj{t)} < 2.25c2||s-tf 

where the first inequality holds because u = u + with u and independent, and 
the second inequality follows from our assumption on uj. Therefore D{e,T,d) < {1 + 
1.5c(6 — a)/e}'^. Next, bound the diameter A as follows 

A^ = 2.25 sup Var{^(s) - ^(t)} < 2.25 sup 2[Var{^(s)} + Var{^(t)}] < 9^^. (12) 

s,t<=T s,t£T 



Now use log(l + x) < X to bound the right hand side of ( 10 ) by 

9 / \/plog{l + 1.5c(6-a)/e}de < 9y/l.5pc{b - a) / e^^^^de = Bk^/'^ 
Jo Jo 

as desired. 

(ii) Now, to calculate || sup^g^ |^(t)||4i, note that the condition of Lemma 3.4 of iPollard 



T77 



( |1990[ ) is trivially satisfied with d{s,t) = \\({s) - C(i)lk = 1.5[Var{^(s) - ^(t)}] 
due to discreteness of S and D{e,T,d) < \S\ for all e > 0. Therefore we can apply 
(10) with S instead of T to conclude || sup^g^ |^(t)| ||>ii < Ay^2 + log \ S\ where A^ = 
sup g ffzs d{s,t)'^ < 9k| as in (12). 



□ 
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